How it works

The integration of Solr indexing with Texpress provides an alternative but compatible way of searching for records. In this section the integration is examined in detail, providing useful information for those who would like to interact with the Solr indexes directly. Such interaction may be useful for web-based systems.

When the schema for a Texpress table is saved using texdesign, the equivalent Solr schema is generated. The Texpress schema is located in the file:

data/table/ins

which contains the Insertion Form; and the Solr schema is found in the file:

solr/table/solr/conf/schema.xml

The Solr schema is in two sections. The first contains the field types used to define the indexing types to apply to fields. The field types are enclosed within the <fieldType> XML tag. For example:

<fieldType name='string' class='solr.StrField' stored='false' />

<fieldType name='strings' class='solr.StrField' stored='false' multiValued='true' />

<fieldType name='int' class='solr.LongPointField' stored='false' />

<fieldType name='ints' class='solr.LongPointField' stored='false' multiValued='true' />

The second section lists the field indexes and are defined using the <field> XML tag. For example:

<field name='irn_value' type='value' />

<field name='SummaryData_text' type='text' />

<field name='SummaryData_phonetic' type='phonetic' />

<field name='ExtendedData_text' type='text' />

The Solr schema file is generated automatically and must not be altered by hand. It is very important that the Texpress schema and the Solr schema are in sync.

The Solr field types declare the type of index to be built for a given data type. Each field type maps onto a Texpress index type. The table below shows the mapping between the Solr field type and the corresponding Texpress index type:

Solr field type Texpress indexing type Comments

text

HASH (word)

Word-based index where each word and symbol is indexed.

texts

HASH (word)

Same as text except used for multi-value text fields.

phonetic

PHONETIC (word)

Word-based index where the phonetic (sounds like) of each word and symbol is indexed.

phonetics

PHONETIC (word)

Same as phonetic except used for multi-value text fields.

stem

STEM (word)

Word-based index where the stem (base word) of each word and symbol is indexed.

stems

STEM (word)

Same as stem except used for multi-value text fields.

value

RANGE

The exact value for range and tuple-based fields. Range fields are date, time, latitude and longitude fields. Tuple fields are Texpress library items with more than one field or key items.

values

RANGE

Same as value except used for multi-value text fields.

lower

RANGE

The minimum value for a partial value (e.g. Jan 2020 has a minimum value of 1st Jan 2020).

lowers

RANGE

Same as lower except used for multi-value text fields.

upper

RANGE

The maximum value for a partial value (e.g. Jan 2020 has a maximum value of 31st Jan 2020).

uppers

RANGE

Same as upper except used for multi-value text fields.

string

HASH (string)

Term-based index where the complete value is indexed as a single term.

strings

HASH (string)

Same as string except used for multi-value text fields.

int

HASH (integer)

Integer number-based index. Supports exact value and range-based searching.

ints

HASH (integer)

Same as int except used for multi-value text fields.

real

HASH (float)

Floating point number-based index. Supports exact value and range based searching.

reals

HASH (float)

Same as real except used for multi-value text fields.

null

NULL

Index used to search whether a field is empty or non-empty.

Each field in the Texpress schema with indexing enabled will have an entry in the Solr schema. For each Texpress index type enabled a corresponding Solr field value is generated of the same index type. The name given to the Solr index is the field name appended with an underscore followed by the Solr field type. For example the NamTitle field in the eparties module has the following Texpress index types enabled:

  • HASH (word)
  • STEM
  • NULL

The corresponding Solr field definitions are:

  • NamTitle_text
  • NamTitle_stem
  • NamTitle_null

When a record is saved, Texpress processes each field one at a time. For each field it generates the terms for each index type enabled. When Solr indexing is enabled, the terms are stored in the field corresponding to the index type. For example if a record had the value Doctor in the NamTitle field, the following values would be added to each Solr field:

Field

Value

NamTitle_text doctor
NamTitle_stem doct
NamTitle_null false

The term Doctor is converted to lowercase by Texpress so that case insignificant searching is the default.

When a search is performed on a table, a check of its indexing type is made. If Solr indexing is enabled, the query engine generates a Solr query on the required index supplying the query term. For example, if a search is performed on eparties for records where the Title field (NamTitle) contains the value Doctor, the Solr query NamTitle_text:doctor is generated and sent to Solr for processing. Solr returns a list of record offsets matching the query supplied. The offsets are added to the list of matching records.

In order to reduce indexing overhead, the query terms generated by Texpress are added to the Solr indexes and not stored as data. The query terms cannot be displayed, only searched. Such an approach avoids the overhead of saving the indexed values for each record. Also the complete data for the record is not stored, only the offset in the Texpress data file where the record is located. The combination of these two optimizations vastly reduces the indexing overhead. For example, the indexing size for a 2GB data file for the eaudit table using Texpress indexing is 3.7GB, while the size for Solr indexing is 630MB (0.6GB). In ratio terms the Texpress indexing in this case is over six times larger than Solr indexing.

If the solrdata option is enabled, the Solr indexes will also contain a field called _data_. The field is not used by EMu, but is available to third party applications that search Solr directly. The field value is a JSON string containing a JSON representation of the record indexed. If the _data_ field is to be retrieved, the fl (Field List) parameter in Solr should contain _data_:[json] to ensure the _data_ string field is translated into JSON itself.

The JSON record generated by the solrdata option contains the full data for the record. Also all records are stored. In effect the EMu security system is bypassed if this option is enabled. Care must be taken when retrieving data to ensure any institution based policies are observed before displaying data. If third party access is required where record level security is observed, the EMu RESTful API should be used.